Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Nov 11, 2025

The DeepSeek self attention mechanism is quite different from other models, so merging "Q" and "K" model tensors is also much trickier than doing that for standard self attention. But I was curious to see if it can be done, and this PR shows that it is possible.

For DeepSeek-Lite fully offloaded this gives 1.5-2% benefit for TG performance.

I cannot test with the larger siblings (R1/V3/Kimi2), so not sure if I have not broken something as there is one additional matrix multiplication involved, and it is easy to make a mistake with the views into the result of the merged matrix multiplication.

As with other Q/K/V merges, enabling this will disable mmap.

The option is disabled by default and gets enabled by -mqkv.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 11, 2025

Didn't see much of a boost or any negative side effects. Tested IQ2 V3.

@ikawrakow
Copy link
Owner Author

@Ph0rk0z

Thanks for testing. Did you use -mqkv in your testing?

@calvin2021y
Copy link

for short context and cpu only, Kimi-K2-Thinking-UD-Q4_K_XL without -mqkv I get 12.7 tps, with -mqkv I get 12.6 tps. Do i need use-ctk q8_0?

@ikawrakow
Copy link
Owner Author

Do i need use-ctk q8_0?

When running CPU-only -ctk q8_0 tends to improve performance, with benefit increasing with context length.

@calvin2021y
Copy link

this pr should improve performance without -ctk q8_0 ?

on my zen4 cpu test, -ctk q8_0 slow down tps for short context.

I will try some more context length like over 10K.

@ikawrakow
Copy link
Owner Author

Yes, the change in performance should not depend on the KV cache type.

But I'm surprised your Zen4 CPU has a lower performance for q8_0 K-cache. What is the CPU and how many threads are you using?

@calvin2021y
Copy link

9454P, --threads 48 --threads-batch 96

@ikawrakow
Copy link
Owner Author

--threads-batch 96 improves PP performance compared to --threads-batch 48 ?

@calvin2021y
Copy link

thanks for the tips, test this pr without -mqkv:

--threads 48 --threads-batch 96 73.8 PP tps, 11.3 tps.
--threads 48 --threads-batch 48 79.0 PP tps, 11.5 tps.

with -mqkv:

--threads 48 --threads-batch 96 73.7 PP tps, 11.3 tps.
--threads 48 --threads-batch 48 78.8 PP tps, 11.5 tps.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 12, 2025

Yep, I put mqkv.

on the other topic with my xeons:
I too see slightly more PP on llama-bench with the hypercores, but TG suffers and the PP gain is not consistent. I have settled into just using 48 threads. Even if numa distribute sometimes picks the hypercore instead of a physical core. It usually leaves the physical mirror alone in that case. Post initial load, using numa numactl respects the cores but speeds slightly lower or the same.

A lot of these tweaks have been minor on their own and then I put them all on one day and gain a t/s or two. Individually they are often lost in the noise of the sweep bench.

BTW, llama-bench is segfaulting with deepseek for some reason.

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl -C 0-47 --interleave=all ./bin/llama-bench \
    -m /DeepSeek-V3-0324-UD-IQ2_XXS-00001-of-00005.gguf \
    -t 48 \
    --numa distribute \
    -ngl 62 \
    -ctk q8_0 \
    -ctv q8_0 \
    -mmp 0 \
    -mla 3 \
    -ub 4096 \
    -b 4096 \
    -amb 1024 \
    -mqkv 1 \
    -cuda offload-batch-size=0,fusion=1 \
    -ot "blk\.(6|7|8)\.ffn_.*(exps).=CUDA0" \
    -ot "blk\.(9|10|11|12)\.ffn_.*(exps).=CUDA1" \
    -ot "blk\.(13|14|15|16)\.ffn_.*(exps).=CUDA2" \
    -ot "blk\.(17|18|19|20)\.ffn_.*(exps).=CUDA3" \
    -ot "ffn_.*_exps.=CPU" \
    -p 32,64,128,256,512,1024,2048

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | mla |   amb | mmap | mqkv |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | ----: | ---: | ---: | ------------: | ---------------: |
Segmentation fault (core dumped)

@ikawrakow
Copy link
Owner Author

Yes, on the CPU it may not bring any benefits. It is mostly for inference with full GPU offload when the cost of kernel launch is not negligible compared to the kernel processing time (i.e., for not too large models).

But at least it looks like I haven't broken the graph building, which is good news.

@ikawrakow
Copy link
Owner Author

BTW, llama-bench is segfaulting with deepseek for some reason.

Can you run

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl -C 0-47 --interleave=all gdb --args ./bin/llama-bench all_other_args here

and then say run when the gdb prompt comes up. When it crashes, type bt and post the output.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 12, 2025

| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | mla |   amb | mmap | mqkv |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | ----: | ---: | ---: | ------------: | ---------------: |
[New Thread 0x7fff9699c000 (LWP 97385)]
[New Thread 0x7fff9619b000 (LWP 97386)]
[New Thread 0x7fff7b3fd000 (LWP 97387)]
[New Thread 0x7fff7abfc000 (LWP 97388)]
[New Thread 0x7fff61fff000 (LWP 97389)]
[New Thread 0x7fff617fe000 (LWP 97390)]
[New Thread 0x7fff5adde000 (LWP 97391)]
[New Thread 0x7fff47fff000 (LWP 97392)]

Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
0x00007ffff7f3749b in llm_build_context::build_deepseek2() () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
(gdb) bt
#0  0x00007ffff7f3749b in llm_build_context::build_deepseek2() () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#1  0x00007ffff7f40aa9 in llm_build_context::llama_build_graph(llama_context&, llama_batch const&, bool) ()
   from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#2  0x00007ffff7e6d583 in llama_decode () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#3  0x000055555556cbc1 in test_prompt(llama_context*, int, int, int, int) [clone .constprop.0] ()
#4  0x00005555555656be in main ()
(gdb) 

@ikawrakow
Copy link
Owner Author

Thanks!

Just to make sure the crash is with this PR?

It crashes when building the graph. Not sure I understand why it works for @calvin2021y but crashes for you. And I understand even less why it crashes for you in llama-bench but does not crash with llama-sweep-bench or llama-server.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 12, 2025

I noticed it after this PR but I think a little earlier. I've been trying to build the chart from: #910

I successfully did it for glm but not for deepseek.

@ikawrakow ikawrakow merged commit 9e2b21f into main Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants